query sample
- North America > United States > California (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Research Report > Promising Solution (0.50)
- Overview > Innovation (0.40)
- North America > United States > California (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Research Report > Promising Solution (0.50)
- Overview > Innovation (0.40)
Towards Global Optimal Visual In-Context Learning Prompt Selection
Visual In-Context Learning (VICL) is a prevailing way to transfer visual foundation models to new tasks by leveraging contextual information contained in in-context examples to enhance learning and prediction of query sample. The fundamental problem in VICL is how to select the best prompt to activate its power as much as possible, which is equivalent to the ranking problem to test the in-context behavior of each candidate in the alternative set and select the best one. To utilize more appropriate ranking metric and leverage more comprehensive information among the alternative set, we propose a novel in-context example selection framework to approximately identify the global optimal prompt, i.e. choosing the best performing in-context examples from all alternatives for each query sample. Our method, dubbed Partial2Global, adopts a transformer-based list-wise ranker to provide a more comprehensive comparison within several alternatives, and a consistency-aware ranking aggregator to generate globally consistent ranking. The effectiveness of Partial2Global is validated through experiments on foreground segmentation, single object detection and image colorization, demonstrating that Partial2Global selects consistently better in-context examples compared with other methods, and thus establish the new state-of-the-arts.
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Li, Yanshu, Yang, Jianjiang, Yang, Ziteng, Li, Bozheng, Han, Ligong, He, Hongyang, Yao, Zhengtao, Chen, Yingjie Victor, Fei, Songlin, Liu, Dongfang, Tang, Ruixiang
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (L VLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that L VLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside L VLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. T o address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four L VLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.
- North America > United States > California (0.14)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- North America > Canada (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Africa > Angola > Namibe Province > South Atlantic Ocean (0.04)
Common comments
We thank the reviewers for their positive and constructive feedbacks of this work. Then, we address the comments as follows. Is it robust for different K? So a large K would make the class center too dependent on the additional data. Eq. (6) defines the K based on our experiment. Besides, we will further elaborate on this mechanism in the revision according to the reviewers' comments.
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- North America > Canada > Quebec > Montreal (0.05)
- North America > United States > California (0.04)
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
Li, Yanshu, Yang, Jianjiang, Yun, Tian, Feng, Pinyuan, Huang, Jinfa, Tang, Ruixiang
Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.